Building and evaluating a Gradient Boosting Classifier model for predicting Parkinson's disease based on some features provided in the dataset. Here's a summary of what's happening in the code:
Data Exploration: Initially, the code visualizes the distribution of numeric features using Seaborn's histplot function. This helps in understanding the skewness and distribution of each numeric feature.
Correlation Analysis: Next, the code calculates the correlation matrix among the features and visualizes it as a heatmap using Seaborn's heatmap function. This helps in identifying correlations between different features, which can provide insights into potential relationships within the data.
Data Preprocessing: The 'name' column is dropped from the dataset, assuming it's not contributing to the predictive task. Then, the dataset is split into features (X) and the target variable (Y).
Model Training and Evaluation: The dataset is split into training and testing sets using the train_test_split function from scikit-learn. A Gradient Boosting Classifier model is trained on the training data and evaluated on both the training and testing sets.
Evaluation Metrics: Various evaluation metrics such as accuracy, confusion matrix, recall, classification report, and Cohen's Kappa score are calculated and printed to assess the performance of the model on both the training and testing sets.
Pickling: Finally, there's a mention of creating a pickle file, presumably to save the trained model for future use without having to retrain it every time.
Parkinson's Disease: Parkinson's disease is a neurodegenerative disorder that primarily affects movement. It is characterized by symptoms such as tremors, stiffness, slow movements, and impaired balance. Early diagnosis and treatment can help manage symptoms and improve the quality of life for individuals with Parkinson's disease. Machine learning models like the one built in this code can potentially assist in diagnosing Parkinson's disease based on relevant features extracted from patient data.
pip install pandas-profiling
Requirement already satisfied: pandas-profiling in c:\users\dhanshree\anaconda3\lib\site-packages (3.6.6)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: ydata-profiling in c:\users\dhanshree\anaconda3\lib\site-packages (from pandas-profiling) (4.2.0) Requirement already satisfied: htmlmin==0.1.12 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (0.1.12) Requirement already satisfied: seaborn<0.13,>=0.10.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (0.11.2) Requirement already satisfied: typeguard<3,>=2.13.2 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (2.13.3) Requirement already satisfied: jinja2<3.2,>=2.11.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (2.11.3) Requirement already satisfied: requests<3,>=2.24.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (2.26.0) Requirement already satisfied: multimethod<2,>=1.4 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.9.1) Requirement already satisfied: wordcloud>=1.9.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.9.2) Requirement already satisfied: dacite>=1.8 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.8.1) Requirement already satisfied: visions[type_image_path]==0.7.5 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (0.7.5) Requirement already satisfied: pydantic<2,>=1.8.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.10.12) Requirement already satisfied: pandas!=1.4.0,<2,>1.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.3.4) Requirement already satisfied: tqdm<5,>=4.48.2 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (4.62.3) Requirement already satisfied: imagehash==4.3.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (4.3.1) Requirement already satisfied: matplotlib<4,>=3.2 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (3.4.3) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (6.0) Requirement already satisfied: phik<0.13,>=0.11.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (0.12.3) Requirement already satisfied: statsmodels<1,>=0.13.2 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (0.14.0) Requirement already satisfied: numpy<1.24,>=1.16.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.22.4) Requirement already satisfied: scipy<1.11,>=1.4.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from ydata-profiling->pandas-profiling) (1.7.1) Requirement already satisfied: pillow in c:\users\dhanshree\anaconda3\lib\site-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling) (8.4.0) Requirement already satisfied: PyWavelets in c:\users\dhanshree\anaconda3\lib\site-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling) (1.1.1) Requirement already satisfied: attrs>=19.3.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling) (21.2.0) Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in c:\users\dhanshree\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling) (0.2.0) Requirement already satisfied: networkx>=2.4 in c:\users\dhanshree\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling) (2.6.3) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\dhanshree\anaconda3\lib\site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling->pandas-profiling) (1.1.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\dhanshree\anaconda3\lib\site-packages (from matplotlib<4,>=3.2->ydata-profiling->pandas-profiling) (2.8.2) Requirement already satisfied: cycler>=0.10 in c:\users\dhanshree\anaconda3\lib\site-packages (from matplotlib<4,>=3.2->ydata-profiling->pandas-profiling) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from matplotlib<4,>=3.2->ydata-profiling->pandas-profiling) (1.3.1) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from matplotlib<4,>=3.2->ydata-profiling->pandas-profiling) (3.0.4) Requirement already satisfied: six in c:\users\dhanshree\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib<4,>=3.2->ydata-profiling->pandas-profiling) (1.16.0) Requirement already satisfied: pytz>=2017.3 in c:\users\dhanshree\anaconda3\lib\site-packages (from pandas!=1.4.0,<2,>1.1->ydata-profiling->pandas-profiling) (2021.3) Requirement already satisfied: joblib>=0.14.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from phik<0.13,>=0.11.1->ydata-profiling->pandas-profiling) (1.1.0) Requirement already satisfied: typing-extensions>=4.2.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from pydantic<2,>=1.8.1->ydata-profiling->pandas-profiling) (4.7.1) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\dhanshree\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (1.26.7) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\dhanshree\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\dhanshree\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (3.2) Requirement already satisfied: certifi>=2017.4.17 in c:\users\dhanshree\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (2023.7.22) Requirement already satisfied: packaging>=21.3 in c:\users\dhanshree\anaconda3\lib\site-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling) (23.1) Requirement already satisfied: patsy>=0.5.2 in c:\users\dhanshree\anaconda3\lib\site-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling) (0.5.2) Requirement already satisfied: colorama in c:\users\dhanshree\anaconda3\lib\site-packages (from tqdm<5,>=4.48.2->ydata-profiling->pandas-profiling) (0.4.4)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import os
print (os.getcwd())
D:\Acmegrade (Internship)\Dec 23 DS Day 18 Project-20240214T033544Z-001\Dec 23 DS Day 18 Project\Projects\Detection of Parkinsons Disease
os.chdir('D:\Acmegrade (Internship)\Dec 23 DS Day 18 Project-20240214T033544Z-001\Dec 23 DS Day 18 Project\Projects\Detection of Parkinsons Disease')
print(os.getcwd())
D:\Acmegrade (Internship)\Dec 23 DS Day 18 Project-20240214T033544Z-001\Dec 23 DS Day 18 Project\Projects\Detection of Parkinsons Disease
df=pd.read_csv('parkinsons.data')
display (df)
| name | MDVP:Fo(Hz) | MDVP:Fhi(Hz) | MDVP:Flo(Hz) | MDVP:Jitter(%) | MDVP:Jitter(Abs) | MDVP:RAP | MDVP:PPQ | Jitter:DDP | MDVP:Shimmer | ... | Shimmer:DDA | NHR | HNR | status | RPDE | DFA | spread1 | spread2 | D2 | PPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | phon_R01_S01_1 | 119.992 | 157.302 | 74.997 | 0.00784 | 0.00007 | 0.00370 | 0.00554 | 0.01109 | 0.04374 | ... | 0.06545 | 0.02211 | 21.033 | 1 | 0.414783 | 0.815285 | -4.813031 | 0.266482 | 2.301442 | 0.284654 |
| 1 | phon_R01_S01_2 | 122.400 | 148.650 | 113.819 | 0.00968 | 0.00008 | 0.00465 | 0.00696 | 0.01394 | 0.06134 | ... | 0.09403 | 0.01929 | 19.085 | 1 | 0.458359 | 0.819521 | -4.075192 | 0.335590 | 2.486855 | 0.368674 |
| 2 | phon_R01_S01_3 | 116.682 | 131.111 | 111.555 | 0.01050 | 0.00009 | 0.00544 | 0.00781 | 0.01633 | 0.05233 | ... | 0.08270 | 0.01309 | 20.651 | 1 | 0.429895 | 0.825288 | -4.443179 | 0.311173 | 2.342259 | 0.332634 |
| 3 | phon_R01_S01_4 | 116.676 | 137.871 | 111.366 | 0.00997 | 0.00009 | 0.00502 | 0.00698 | 0.01505 | 0.05492 | ... | 0.08771 | 0.01353 | 20.644 | 1 | 0.434969 | 0.819235 | -4.117501 | 0.334147 | 2.405554 | 0.368975 |
| 4 | phon_R01_S01_5 | 116.014 | 141.781 | 110.655 | 0.01284 | 0.00011 | 0.00655 | 0.00908 | 0.01966 | 0.06425 | ... | 0.10470 | 0.01767 | 19.649 | 1 | 0.417356 | 0.823484 | -3.747787 | 0.234513 | 2.332180 | 0.410335 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 190 | phon_R01_S50_2 | 174.188 | 230.978 | 94.261 | 0.00459 | 0.00003 | 0.00263 | 0.00259 | 0.00790 | 0.04087 | ... | 0.07008 | 0.02764 | 19.517 | 0 | 0.448439 | 0.657899 | -6.538586 | 0.121952 | 2.657476 | 0.133050 |
| 191 | phon_R01_S50_3 | 209.516 | 253.017 | 89.488 | 0.00564 | 0.00003 | 0.00331 | 0.00292 | 0.00994 | 0.02751 | ... | 0.04812 | 0.01810 | 19.147 | 0 | 0.431674 | 0.683244 | -6.195325 | 0.129303 | 2.784312 | 0.168895 |
| 192 | phon_R01_S50_4 | 174.688 | 240.005 | 74.287 | 0.01360 | 0.00008 | 0.00624 | 0.00564 | 0.01873 | 0.02308 | ... | 0.03804 | 0.10715 | 17.883 | 0 | 0.407567 | 0.655683 | -6.787197 | 0.158453 | 2.679772 | 0.131728 |
| 193 | phon_R01_S50_5 | 198.764 | 396.961 | 74.904 | 0.00740 | 0.00004 | 0.00370 | 0.00390 | 0.01109 | 0.02296 | ... | 0.03794 | 0.07223 | 19.020 | 0 | 0.451221 | 0.643956 | -6.744577 | 0.207454 | 2.138608 | 0.123306 |
| 194 | phon_R01_S50_6 | 214.289 | 260.277 | 77.973 | 0.00567 | 0.00003 | 0.00295 | 0.00317 | 0.00885 | 0.01884 | ... | 0.03078 | 0.04398 | 21.209 | 0 | 0.462803 | 0.664357 | -5.724056 | 0.190667 | 2.555477 | 0.148569 |
195 rows × 24 columns
Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
$$status - Health status of the subject (one) -Parkinson's, (zero) - healthy$$RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation
df.head()
| name | MDVP:Fo(Hz) | MDVP:Fhi(Hz) | MDVP:Flo(Hz) | MDVP:Jitter(%) | MDVP:Jitter(Abs) | MDVP:RAP | MDVP:PPQ | Jitter:DDP | MDVP:Shimmer | ... | Shimmer:DDA | NHR | HNR | status | RPDE | DFA | spread1 | spread2 | D2 | PPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | phon_R01_S01_1 | 119.992 | 157.302 | 74.997 | 0.00784 | 0.00007 | 0.00370 | 0.00554 | 0.01109 | 0.04374 | ... | 0.06545 | 0.02211 | 21.033 | 1 | 0.414783 | 0.815285 | -4.813031 | 0.266482 | 2.301442 | 0.284654 |
| 1 | phon_R01_S01_2 | 122.400 | 148.650 | 113.819 | 0.00968 | 0.00008 | 0.00465 | 0.00696 | 0.01394 | 0.06134 | ... | 0.09403 | 0.01929 | 19.085 | 1 | 0.458359 | 0.819521 | -4.075192 | 0.335590 | 2.486855 | 0.368674 |
| 2 | phon_R01_S01_3 | 116.682 | 131.111 | 111.555 | 0.01050 | 0.00009 | 0.00544 | 0.00781 | 0.01633 | 0.05233 | ... | 0.08270 | 0.01309 | 20.651 | 1 | 0.429895 | 0.825288 | -4.443179 | 0.311173 | 2.342259 | 0.332634 |
| 3 | phon_R01_S01_4 | 116.676 | 137.871 | 111.366 | 0.00997 | 0.00009 | 0.00502 | 0.00698 | 0.01505 | 0.05492 | ... | 0.08771 | 0.01353 | 20.644 | 1 | 0.434969 | 0.819235 | -4.117501 | 0.334147 | 2.405554 | 0.368975 |
| 4 | phon_R01_S01_5 | 116.014 | 141.781 | 110.655 | 0.01284 | 0.00011 | 0.00655 | 0.00908 | 0.01966 | 0.06425 | ... | 0.10470 | 0.01767 | 19.649 | 1 | 0.417356 | 0.823484 | -3.747787 | 0.234513 | 2.332180 | 0.410335 |
5 rows × 24 columns
df.tail()
| name | MDVP:Fo(Hz) | MDVP:Fhi(Hz) | MDVP:Flo(Hz) | MDVP:Jitter(%) | MDVP:Jitter(Abs) | MDVP:RAP | MDVP:PPQ | Jitter:DDP | MDVP:Shimmer | ... | Shimmer:DDA | NHR | HNR | status | RPDE | DFA | spread1 | spread2 | D2 | PPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 190 | phon_R01_S50_2 | 174.188 | 230.978 | 94.261 | 0.00459 | 0.00003 | 0.00263 | 0.00259 | 0.00790 | 0.04087 | ... | 0.07008 | 0.02764 | 19.517 | 0 | 0.448439 | 0.657899 | -6.538586 | 0.121952 | 2.657476 | 0.133050 |
| 191 | phon_R01_S50_3 | 209.516 | 253.017 | 89.488 | 0.00564 | 0.00003 | 0.00331 | 0.00292 | 0.00994 | 0.02751 | ... | 0.04812 | 0.01810 | 19.147 | 0 | 0.431674 | 0.683244 | -6.195325 | 0.129303 | 2.784312 | 0.168895 |
| 192 | phon_R01_S50_4 | 174.688 | 240.005 | 74.287 | 0.01360 | 0.00008 | 0.00624 | 0.00564 | 0.01873 | 0.02308 | ... | 0.03804 | 0.10715 | 17.883 | 0 | 0.407567 | 0.655683 | -6.787197 | 0.158453 | 2.679772 | 0.131728 |
| 193 | phon_R01_S50_5 | 198.764 | 396.961 | 74.904 | 0.00740 | 0.00004 | 0.00370 | 0.00390 | 0.01109 | 0.02296 | ... | 0.03794 | 0.07223 | 19.020 | 0 | 0.451221 | 0.643956 | -6.744577 | 0.207454 | 2.138608 | 0.123306 |
| 194 | phon_R01_S50_6 | 214.289 | 260.277 | 77.973 | 0.00567 | 0.00003 | 0.00295 | 0.00317 | 0.00885 | 0.01884 | ... | 0.03078 | 0.04398 | 21.209 | 0 | 0.462803 | 0.664357 | -5.724056 | 0.190667 | 2.555477 | 0.148569 |
5 rows × 24 columns
import pandas_profiling as pf
display(pf.ProfileReport(df))